Medical insurance is an important issue in the United States, and many factors go into the coverage decisions of insurance companies. The interest surrounds the predictability of medical expenses based on the demographics information provided by the individuals’ medical information. Additionally, smoking is a known carcinogen; it is valuable to examine the effects of smoking on other variables within this data set.
Shiva Dumnawar. 2021. Health Insurance, Version 1. Retrieved 5/24/2024 from https://www.kaggle.com/datasets/shivadumnawar/health-insurance-dataset
age sex bmi children
Min. :18.00 Length:1338 Min. :15.96 Min. :0.000
1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000
Median :39.00 Mode :character Median :30.40 Median :1.000
Mean :39.21 Mean :30.66 Mean :1.095
3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
Max. :64.00 Max. :53.13 Max. :5.000
smoker region charges
Length:1338 Length:1338 Min. : 1122
Class :character Class :character 1st Qu.: 4740
Mode :character Mode :character Median : 9382
Mean :13270
3rd Qu.:16640
Max. :63770
Looking at these summary statistics, age, bmi, children, and charges have a good variety in ranges for the size of data. It is also clear that the region, sex, and smoker variables are categorical.
The region, sex, and smoker variables are categorical and therefore show up as “character type” on summary statistics, so converting them to factors will tells more about the variables.
# A tibble: 2 × 2
sex n
<chr> <int>
1 female 662
2 male 676
# A tibble: 2 × 2
smoker n
<chr> <int>
1 no 1064
2 yes 274
# A tibble: 4 × 2
region n
<chr> <int>
1 northeast 324
2 northwest 325
3 southeast 364
4 southwest 325
How do the medical expenses differ between smokers and non-smokers? Specifically, can the medical expense value predict whether the individual is a smoker?
Answer: While the total charges between smokers and non-smokers don’t differ very much, the average individual charge between smokers and non-smokers do differ quite a bit, with average smoker charges much larger than their counterparts. This suggests that this data pool had an unequal number of smokers and non-smokers, but that generally, smokers experienced higher charges than non-smokers. It is possible a higher charge could indicate the individual is a smoker and a regression can confirm this.
How do medical expenses differ by region, children, age, and bmi?
Answer: In terms of region, both charges and average charges look very similar. Overall, the southeast experiences higher total and average charges; the northeast also experiences slightly higher values than the other two regions, but the southeast shows what appears to be an actual significant difference. For the Total & Average Charges by Children, total charges show a decrease in charges with more children; however, for average charges, the average increases up to 3 children, before decreasing; it overall maintains a similar average though. When looking at it by Age, both the total charges and average charges show an increase with age, which is in line with aging. For total and average charges by bmi, there appears to be an increase up to about a 30 bmi, but then stays relatively steady.
How could the demographics of a smoker be described?
Answer: Looking at the smoker status by region, age, and sex, the only region that appears to have higher levels of smoking is the southeast, with 91 people versus 67 and 58 people in other regions. Generally, the rates of smoking have a slight inverse relationship, decreasing with age. The rates of smoking by sex don’t differ too much, though it does show a higher amount of men smoking than women.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| smokeryes | 23848.535 | 413.153 | 57.723 | 0.000 |
| age | 256.856 | 11.899 | 21.587 | 0.000 |
| (Intercept) | -11938.539 | 987.819 | -12.086 | 0.000 |
| bmi | 339.193 | 28.599 | 11.860 | 0.000 |
| children | 475.501 | 137.804 | 3.451 | 0.001 |
| regionsoutheast | -1035.022 | 478.692 | -2.162 | 0.031 |
| regionsouthwest | -960.051 | 477.933 | -2.009 | 0.045 |
| regionnorthwest | -352.964 | 476.276 | -0.741 | 0.459 |
| sexmale | -131.314 | 332.945 | -0.394 | 0.693 |
Comparing this model to the JMP multiple regression, this one appears to account for more variance. That being said, the Neural model is better than both regression models, as for both training and validation, it has a higher r-sq value and lower RASE value. This suggests that the model accounts for more variance with less errors in prediction, which is ideal for a model. However, from the regression we can see which variables are significant to predicting the charges. According to the R regression model, the significant variables (where p < 0.05) are all variables except northwest, southwest, and sex. This aligns well with the graphs present in question 2.
By comparing the models with the confusion matrix and the r-sqr values, the boosted model is best. The models all have fairly similar accuracy and sensitivity levels; however, when comparing the r-sqr values, the boosted model has the highest. The boosted model accounts for 76.59% of variability- the highest of all validation r-sqr values. When examining which predictors are significant towards predicting whether an individual is a smoker, all are significant other than region. Charges followed by bmi, and then age are the biggest predictors according to the effect summary.
By modeling charges and smoker variables, the significant predictors for the two were identified as all predictors except northwest, southwest, and sex for charges, and all predictors except region for smoker.
---
title: "INFO 3200 Health Insurance Dashboard"
output:
flexdashboard::flex_dashboard:
vertical_layout: scroll
source_code: embed
---
<style>
.navbar {
background-color: green;
border-color:white;
}
.navbar-brand {
color:white!important;
}
</style>
```{r setup, include=FALSE, warning=FALSE}
#include=FALSE will not include r code in output
#warning=FALSE will remove any warnings from output
library(flexdashboard)
library(tidyverse)
library(GGally)
library(caret) #for logistic regression
library(broom) #for tidy() function
```
```{r load_data}
df <- read_csv("health_insurance.csv")
```
Introduction {data-orientation=rows}
=======================================================================
row {data-height=650)
-----------------------------------------------------------------------
### The Project
##### Overview
Medical insurance is an important issue in the United States, and many factors go into the coverage decisions of insurance companies. The interest surrounds the predictability of medical expenses based on the demographics information provided by the individuals’ medical information. Additionally, smoking is a known carcinogen; it is valuable to examine the effects of smoking on other variables within this data set.
##### Questions
1. How do the medical expenses differ between smokers and non-smokers? Specifically, can the medical
expense value predict whether the individual is a smoker?
2. How do medical expenses differ by region, children, age, and bmi?
3. How could the demographics of a smoker be described?
##### Data Source
Shiva Dumnawar. 2021. Health Insurance, Version 1. Retrieved 5/24/2024 from https://www.kaggle.com/datasets/shivadumnawar/health-insurance-dataset
### The Data
##### Description of the Variables in the Dataset
* **age**: This number represents the individual’s age.
* **sex**: Whether the individual’s gender assigned at birth is “male” or “female”.
* **bmi**: The value represents the individual’s body mass index number.
* **children**: This value is the number of children the individual has.
* **smoker**: Can either be “yes” if the individual is a smoker or “no” if the individual is not a smoker.
* **region**: The U.S. region the individual is from (southeast, southwest, northeast, or northwest).
* **charges**: This number is the value of the individual’s medical expenses.
##### Variables to Predict & Data Changes
* **charges**: Continuous
* **smoker**: Classification
* **validation**: A 60:40 validation column was created to help in prediction modeling.
Exploration of Data {data-orientation=rows}
=======================================================================
### Summary Statistics
```{r}
#View data
summary(df)
```
Looking at these summary statistics, age, bmi, children, and charges have a good variety in ranges for the size of data. It is also clear that the region, sex, and smoker variables are categorical.
### Categorical Variable Distributions
The region, sex, and smoker variables are categorical and therefore show up as "character type" on summary statistics, so converting them to factors will tells more about the variables.
```{r, cache=TRUE}
df <- mutate(df,sex=as.factor(sex),
smoker=as.factor(smoker),region=as.factor(region))
```
#### Sex (female or male)
```{r, cache=TRUE}
as_tibble (select(df,sex) %>%
table())
```
#### Smoker (yes or no)
```{r, cache=TRUE}
as_tibble (select(df,smoker) %>%
table())
```
#### Region (southeast, southwest, northeast, or northwest)
```{r, cache=TRUE}
as_tibble (select(df,region) %>%
table())
```
row {data-width=250}
-----------------------------------------------------------------------
#### Frequencies

Question #1 {data-orientation=rows}
=======================================================================
### Question:
*How do the medical expenses differ between smokers and non-smokers? Specifically, can the medical expense value predict whether the individual is a smoker?*
**Answer**: While the total charges between smokers and non-smokers don't differ very much, the average individual charge between smokers and non-smokers do differ quite a bit, with average smoker charges much larger than their counterparts. This suggests that this data pool had an unequal number of smokers and non-smokers, but that generally, smokers experienced higher charges than non-smokers. It is possible a higher charge could indicate the individual is a smoker and a regression can confirm this.
row {data-height=500}
-----------------------------------------------------------------------
### Total Charges by Smoker Status

row {data-height=500}
-----------------------------------------------------------------------
### Average Charges by Smoker Status

Question #2 {data-orientation=rows}
=======================================================================
### Question:
*How do medical expenses differ by region, children, age, and bmi?*
**Answer**: In terms of region, both charges and average charges look very similar. Overall, the southeast experiences higher total and average charges; the northeast also experiences slightly higher values than the other two regions, but the southeast shows what appears to be an actual significant difference. For the Total & Average Charges by Children, total charges show a decrease in charges with more children; however, for average charges, the average increases up to 3 children, before decreasing; it overall maintains a similar average though. When looking at it by Age, both the total charges and average charges show an increase with age, which is in line with aging. For total and average charges by bmi, there appears to be an increase up to about a 30 bmi, but then stays relatively steady.
Column {data-width=250}
-----------------------------------------------------------------------
### Total & Average Charges by Region

Column {data-width=250}
-----------------------------------------------------------------------
### Total & Average Charges by Children

Column {data-width=250}
-----------------------------------------------------------------------
### Total & Average Charges by Age

Column {data-width=250}
-----------------------------------------------------------------------
### Total & Average Charges by BMI

Question #3 {data-orientation=rows}
=======================================================================
### Question:
*How could the demographics of a smoker be described?*
**Answer**: Looking at the smoker status by region, age, and sex, the only region that appears to have higher levels of smoking is the southeast, with 91 people versus 67 and 58 people in other regions. Generally, the rates of smoking have a slight inverse relationship, decreasing with age. The rates of smoking by sex don't differ too much, though it does show a higher amount of men smoking than women.
Column {data-width=250}
-----------------------------------------------------------------------
### Smoking by Region

Column {data-width=250}
-----------------------------------------------------------------------
### Smoking by Age

Column {data-width=250}
-----------------------------------------------------------------------
### Smoking by Sex

Continuous {data-orientation=rows}
=======================================================================
```{r,include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
charges_lm <- lm(charges ~ .,data = df)
summary(charges_lm)
```
```{r,include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
tidy(charges_lm)
```
Column
-----------------------------------------------------------------------
### Regression Output
```{r,include=FALSE, cache=TRUE}
#knitr::kable(summary(charges_lm)$coef, digits = 3) #pretty table output
summary(charges_lm)$coef
```
```{r, cache=TRUE}
# this version sorts the p-values (it is using an index to reorder the coefficients)
idx <- order(coef(summary(charges_lm))[,4])
out <- coef(summary(charges_lm))[idx,]
knitr::kable(out, digits = 3) #pretty table output
```
Column
-----------------------------------------------------------------------
### Residual Assumptions Explorations
```{r, cache=TRUE}
plot(charges_lm, which=c(1,2)) #which tells which plots to show (1-6 different plots)
```
Row
-----------------------------------------------------------------------
### Adjusted R-Squared
```{r, cache=TRUE}
ARSq<-round(summary(charges_lm)$adj.r.squared,2)
valueBox(paste(ARSq*100,'%'))
```
### RMSE
```{r, cache=TRUE}
Sig<-round(summary(charges_lm)$sigma,2)
valueBox(Sig)
```
Column
-----------------------------------------------------------------------

### Analysis Summary
Comparing this model to the JMP multiple regression, this one appears to account for more variance. That being said, the Neural model is better than both regression models, as for both training and validation, it has a higher r-sq value and lower RASE value. This suggests that the model accounts for more variance with less errors in prediction, which is ideal for a model. However, from the regression we can see which variables are significant to predicting the charges. According to the R regression model, the significant variables (where p < 0.05) are all variables except northwest, southwest, and sex. This aligns well with the graphs present in question 2.
Classification {data-orientation=rows}
=======================================================================
Row
-------------------------------------
### Confusion Matrix & Errors

Row {.tabset .tabset-fade}
-------------------------------------
### Nominal Logistic

### Decision Tree

### Boosted Tree

Row
-------------------------------------
### Analysis Summary
By comparing the models with the confusion matrix and the r-sqr values, the boosted model is best. The models all have fairly similar accuracy and sensitivity levels; however, when comparing the r-sqr values, the boosted model has the highest. The boosted model accounts for 76.59% of variability- the highest of all validation r-sqr values. When examining which predictors are significant towards predicting whether an individual is a smoker, all are significant other than region. Charges followed by bmi, and then age are the biggest predictors according to the effect summary.
Conclusion
=======================================================================
### Summary
By modeling charges and smoker variables, the significant predictors for the two were identified as all predictors except northwest, southwest, and sex for charges, and all predictors except region for smoker.